Quantifying the impact of positive stress on companies from online employee reviews

Workplace stress is often considered to be negative, yet lab studies on individuals suggest that not all stress is bad. There are two types of stress: distress refers to harmful stimuli, while eustress refers to healthy, euphoric stimuli that create a sense of fulfillment and achievement. Telling the two types of stress apart is challenging, let alone quantifying their impact across corporations. By leveraging a dataset of 440 K reviews about S &P 500 companies published during twelve successive years, we developed a deep learning framework to extract stress mentions from these reviews. We proposed a new methodology that places each company on a stress-by-rating quadrant (based on its overall stress score and overall rating on the site), and accordingly scores the company to be, on average, either a low stress, passive, negative stress, or positive stress company. We found that (former) employees of positive stress companies tended to describe high-growth and collaborative workplaces in their reviews, and that such companies’ stock evaluations grew, on average, 5.1 times in 10 years (2009–2019) as opposed to the companies of the other three stress types that grew, on average, 3.7 times in the same time period. We also found that the four stress scores aggregated every year—from 2008 to 2020 —closely followed the unemployment rate in the U.S.: a year of positive stress (2008) was rapidly followed by several years of negative stress (2009–2015), which peaked during the Great Recession (2009–2011). These results suggest that automated analyses of the language used by employees on corporate social-networking tools offer yet another way of tracking workplace stress, allowing quantification of its impact on corporations.


Description and evaluation of the deep-learning framework
To extract stress mentions, we used the MedDL entity extraction module 1 (the left rectangle in Figure 1(a)). MedDL uses contextual embeddings and a BiLSTM-CRF sequence labeling architecture. The BiLSTM-CRF architecture 2 is the deeplearning method commonly employed for accurately extracting entities from text 3,4 , and consists of two layers. The first layer is a BiLSMT network (the dashed rectangle in Figure 1(a)), which stands for Bi-directional Long Short-Term Memory (LSTM). The outputs of the BiLSTM are then passed to the second layer: the CRF layer (enclosed in the other dashed rectangle). The predictions of the second layer (the white squares in Figure 1(a)) represent the output of the entity extraction module. To extract the medical entities of symptoms and drug names, BiLSTM-CRF takes as input representations of words (i.e., embeddings). The most commonly used embeddings are Global Vectors for Word Representation (GloVe) 5 and Distributed Representations of Words (word2vec) 6 . However, these do not take into account a word's context. The word 'pressure', for example, could be a stress symptom at the workplace (e.g., 'I felt constant pressure to deliver results') or could be used in the physics context (e.g., 'The solid material found in the centre of some planets at extremely high temperature and pressure'). To account for context, contextual embeddings are generally used. MedDL used the RoBERTa embeddings as it had outperformed several others contextual embeddings, including ELMo, BioBert and Clinical BERT 1 . Our evaluation metric is F1 score, which is the harmonic mean of precision P and recall R: P = #correctly classified medical entities #total entities classified as being medical and R = #correctly classified medical entities #total medical entities .
For strict F-1 score, we counted as "correctly classified" only the entities that were exactly matching the ground truth labels. For relaxed version of F-1 score, partially matching entities are also counted as correctly classified (e.g., if the model extracts the entity "pain" given the full mention of "strong pain"). Also, given that our data comes with class imbalance (i.e., text 4/11 Figure S4. MedDL strict/relaxed F-1 score results when extracting medical symptoms on the MedRed dataset compared to two competitive alternatives of MetaMap and TaggerOne. tokens do not correspond equally to symptoms, or non-medical entities), we corrected for that by computing P and R using micro-averages ? . In so doing, we were able to compare Med-DL's F1 scores with those of two well-known entity extraction tools: MetaMap and TaggerOne. MetaMap is a well-established tool for extracting medical concepts from text using symbolic NLP and computational-linguistic techniques 7 , and has become a de-facto baseline method for NLP studies related to health 8 . TaggerOne is a machine learning tool using semi-Markov models to jointly perform two tasks: entity extraction and entity normalization. The tool does so using a medical lexicon 9   Annotations of the words BERTopic found. For each topic, we identified the three most representative words and submitted the reviews mentioning them to six annotators. For example, we picked three reviews containing the words 'overtime', 'mandatory', and 'shift' for negative stress companies, and asked six annotators to read them and describe what type of workplaces these reviews would suggest. Upon collecting a total of 72 free-form responses (i.e., each annotator described the reviews corresponding to the 12 topics), we conducted a thematic analysis 10 . To identify overarching themes, we used a combination of open coding and axial coding. We first applied open coding to identify key concepts. Specifically, one of the authors read the responses and marked them with keywords. We then used axial coding to identify relationships between the most frequent keywords to summarize them into semantically cohesive themes. We found three high-level themes: career drivers, industry or benefits, and emotional aspects. In the reviews, each theme was paraphrased differently depending on the four types of company stress, allowing us to identify sub-themes. The career drivers theme described what motivated employees to go to work. Its sub-themes concerned companies whose employees experienced 'considerable emotional pressure' (negative stress), tended to 'focus on activities outside the work' (passive), cherished 'their sense of control over their work' (low stress), and enjoyed 'a collaborative and supportive workplace culture' (positive stress). In the industry or benefits theme, we identified sub-themes mentioning either the industry sectors of the corresponding companies (e.g., Consumer Discretionary for negative stress, and Information Technology for positive stress) or aspects concerning long-term financial benefits (e.g., passive and low stress). Finally, in the emotional aspects theme, we identified sub-themes suggesting employees who experienced 'emotional pressure' (negative stress), 'tedious work' (passive), 'good work-life balance' (low stress), or a 'fast-paced, high-performing, and dynamic workplace environment' (positive stress).

Evaluation of BERTopic results
We ran the topic modeling algorithm BERTopic 11 separately on the four sets of reviews (each set containing reviews of the companies of a given stress type). The fact that BERTopic discovered distinct topics in the four sets reveals that stress is paraphrased differently in the sets. We calculated the topical overlapping values for the different combinations of the four sets (using the Jaccard similarity on the sets of keywords from the top ten topics of each stress type), and found them to be (on average) as low as 0.08 (on a scale ranging from 0 to 1).

Evaluation of the four quadrants
To test whether the quadrant division of companies into four types was meaningful, we manually inspected 30 posts taken at random from companies with high stress, and found stress mentions in companies with low ratings to be qualitatively different from those in companies with high ratings (e.g., a review from a lowly rated company "The pressure is constantly high, while your work is not appreciated [...] and it feels like the managers do not know what they are doing." versus a review from a highly rated company "Happy Employee. Best culture I have experienced, especially in a stressful job. [...] The job is hard, but nothing worth having comes easy."). Similarly, we found qualitatively different review between companies with low stress and high versus low ratings (e.g., a review from a highly rated company "Solid company offering Work From Home. [...] decent options to choose for hours worked, great tech support, all equipment supplied, always feel connected to team, strong work ethic. ", versus a review from a lowly rated company "Sinking Ship due to Horribly Managed [...] Merger. At legacy X office, they managed to retain some of the positive company culture leftover from the X days. The people are still the best part of that office, but with the increasing turnover, layoffs and "Hunger Games" management style, that is in danger of ending... "). As a final validity check, we arranged companies along the two axes and clustered them in an unsupervised way. We found four to be the best number of clusters. More specifically, we applied k-means clustering, and searched for the optimal number of clusters using the elbow method ( Figure S9). The method involves calculating the sum of squared distances between data points and the k assigned clusters' centroids, for an increasing number of clusters k. Once this value stops decreasing significantly, it means that that the optimal number of clusters is reached.

Sensitivity of the results
Weighting the scores. We explored the effects of weighting the yearly scores in: by plotting the temporal scores without weights, i.e., where w = 1. The result is shown in Figure S10. The simple aggregation skews the results towards (the long tail of) small companies as it considers a small company equal to a big one.
Shorter-term growth. To test whether our results on stock growth are not affected by exogenous events such as the Great where stock i is the average adjusted closing price of their stocks in year i. Figure S5 shows that the trend remains qualitatively the same as that in Figure  Interaction effects between stress scores and review ratings. We tested whether our observed stock growth was genuinely associated with positive stress companies rather than being simply associated with highly-rated companies. To this end, for each stress type, we plottedḠ M(stock_growth [09−19] ) against different rating percentiles ( Figure S6). Highly rated companies experienced stock growth, yet there are still significant differences across companies of different stress types: in particular, positive stress companies of varying rating percentiles consistently enjoyed the highest growth (the yellow line in Figure S6 is consistently above the other three lines).
Growth per industry sectors. To test whether a specific industry sector is predominant for a given stress type, we first plotted the number of companies per industry sector according to the GICS classification ( Figure S7). Information Technology was more prominent among positive stress and low stress companies, Health Care and Financials among negative stress ones, and Industrials and Consumer Discretionary among passive ones. To then check whether the distribution of industry sectors across the four types of stress affected our findings for stock growth, we computed stock growth between 2009 and 2019, and did so for the three most frequent industry sectors separately (i.e., Information Technology, Consumer Discretionary, and Health Care). We chose those three sectors because each individually contained a sufficient number of companies and, as such, allowed us to obtain statistical significant results. Stock growth was computed as GM(stock growth [09−19] ) = Π(stock growth [09−19] (c)) 1/n , where c is each company from a given industry sector (e.g., Information Technology) in a specific (stress type,percentile) bin, and n is the number of the companies in such a bin. For the three industry sectors, we plottedḠ M(stock_growth) against different stress score percentiles ( Figure S8). In all three sectors, we observed that positive stress companies had consistently higher stock growth compared to the other three stress types.
Percentage of stress posts. To test the sensitivity of our results to the percentage of stress posts being considered, we repeated 9/11 Figure S11. Threshold selection. Correlation values between each of the two stress scores and a company's website overall rating (y-axis) for the companies with at least r reviews (x-axis). These values have a phase shift at r = 280 for positive stress companies (blue), matching the value of the correlation for negative stress companies (red).
our analyses by including only the companies with at least r reviews. We found the optimal threshold r to be 280, and did so as follows. To include at least half of the total S&P 500 companies, the least number of reviews per company had to be less than r = 350. Then, for each r = 1, ..., 350, we subset the companies having at least r reviews, and calculated the correlation between a company's rating and its positive stress score (for positive stress companies) or its negative stress score (for negative stress companies), and did so for each subset. We found that the absolute values of the correlations increased with the number of reviews ( Figure S11), as one expected, and there was a phase shift at r = 280 for positive stress companies (ρ(company_rating, positive_stress_association)=.75). The same applied to negative stress companies ( Figure S11). At this threshold, we were left with 287 companies out of 380 companies in total. We repeated the calculations on this subset of companies and, compared to our previously reported results, found even stronger associations between: i) negative stress scores in the whole U.S. and the Great Recession, and ii) a company's positive stress score and its stock growth.